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Abstract 

Iterative imputation, in which variables are imputed one at a time each given a model 
predicting from all the others, is a popular technique that can be convenient and flexible, as it 
replaces a potentially difficult multivariate modeling problem with relatively simple univariate 
regressions. In this paper, we begin to characterize the stationary distributions of iterative 
imputations and their statistical properties. More precisely, when the conditional models are 
compatible (defined in the text) , we give a set of sufRcient conditions under which the imputation 
distribution converges in total variation to the posterior distribution of a Bayesian model. When 
the conditional models are incompatible but are valid, we show that the combined imputation 
estimator is consistent. 

1 Introduction 

Iterative imputation is a widely used approach for imputing multivariate missing data. The proce- 
dure starts by randomly imputing missing values using some simple stochastic algorithm. Missing 
values are then imputed one variable at a time, each conditionally on all the others using a model fit 
to the current iteration of the completed data. The variables are looped through until approximate 
convergence (as measured, for example, by the mixing of multiple chains). 

Iterative imputation can be an appealing way to express uncertainty about missing data. There 
is no need to explicitly construct a joint multivariate model of all types of variables: continuous, or- 
dinal, categorical, and so forth. Instead, one need only specify a sequence of families of conditional 
models such as linear regression, logistic regression, and other standard and already programmed 
forms. The distribution of the resulting imputations is implicitly defined as the invariant (sta- 
tionary) distribution of the Markov chain corresponding to the iterative fitting and imputation 
process. 

Iterative, or chained, imputation is convenient and flexible and has been implemented in vari- 
ous ways in several statistical software packages, including mice [28] and mi [26] in R, IVEware |16j 
in SAS, and ice in Stata [191 EO]- The popularity of these programs suggests that the resulting 
imputations are believed to be of practical value. However, the theoretical properties of iterative 
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imputation algorithms are not well understood. Even if, as we would prefer, the fitting of each 
imputation model and the imputations themselves are performed using conditional Bayesian infer- 
ence, the stationary distribution of the algorithm (if it exists) does not in general correspond to 
Bayesian inference on any specified multivariate distribution. 

Key questions are: (1) Under what conditions does the algorithm converge to a stationary dis- 
tribution? (2) What statistical properties does the procedure admit given that a unique stationary 
distribution exists? 

Regarding the first question, researchers have long known that the Markov chain may be non- 
recurrent ("blowing up" to infinity or drifting like a nonstationary random walk), even if each of 
the conditional models is fitted using a proper prior distribution. 

In this paper, we focus mostly on the second question — the characterization of the stationary 
distributions of the iterative imputation conditional on its existence. Unlike usual MCMC algo- 
rithms, which are designed in such a way that the invariant distribution and target distribution are 
identical, the invariant distribution of iterative imputation (even if it exists) is largely unknown. 

The analysis of iterative imputation is challenging for at least two reasons. First, the range of 
choices of conditional models is wide, and it would be difficult to provide a solution applicable to 
all situations. Second, the literature on Markov chains focuses on known transition distributions. 
With iterative imputation, the distributions for the imputations are known only within specified 
parametric families. For example, if a particular variable is to be updated conditional on all the 
others using logistic regression, the actual updating distribution depends on the logistic regression 
coefficients which are themselves estimated given the latest update of the missing values. 

The main contribution of this paper is to develop a mathematical framework under which the 
asymptotic properties of iterative imputation can be discussed. In particular, we demonstrate the 
following results. 

1. Given the existence of a unique invariant (stationary) distribution of the iterative imputation 
Markov chain, we provide a set of conditions under which this distribution converges in total 
variation to the posterior distribution of a joint Bayesian model, as the sample size tends 
to infinity. Under these conditions, iterative imputation is asymptotically equivalent to full 
Bayesian imputation using some joint model. Among these conditions, the most important is 
that the conditional models are compatible — that there exists a joint model whose conditional 
distributions are identical to the conditional models specified by the iterative imputation 
(Definition [T]) . We discuss in Section |3j 

2. We consider model compatibility as a typically necessary condition for the iterative imputation 



distribution to converge to the posterior distribution of some Bayesian model (Section 3.4) 



3. For incompatible models whose imputation distributions are generally different from any 
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Bayesian model, we show that the combined completed-data maximum UkeUhood estimate 
of the iterative imputation is a consistent estimator if the set of conditional models is valid, 
that is, if each conditional family contains the true probability distribution (Definition [s] in 
Section |4)). 

The analysis presented in this paper connects to the existing separate literatures on missing 
data imputation and Markov chain convergence. Standard textbooks on imputation inference are 
[m [22] , and some key papers are [lOl [3l [13l [HI EH [231 El] • Large sample properties are studied 
by [El [251 EH], small samples are by |3], and the issue of congeniality between the imputer's and 
analyst's models is considered by [T3]. 

Our asymptotic findings for compatible and incompatible models use results on convergence of 
Markov chains, a subject on which there is a vast literature on stability and rate of convergence 
(|H1 m [2])- In addition, empirical diagnostics of Markov chains have been suggested by many 
authors, for instance, [7]. For the analysis of compatible models, we need to construct a bound 
for convergence rate using renewal theory [H [TE[ I18j . which has the advantage of not assuming 
the existence of an invariant distribution, which is naturally yielded by the minorization and drift 
conditions. 

In Section [2] of this article, we lay out our notation and assumptions. We then briefly review the 
framework of iterative imputation and the Gibbs sampler. In Section |3j we investigate compatible 
conditional models. In Section |4j the discussion focuses on incompatible models. Section [5] includes 
several simulation examples. An appendix is attached containing the technical developments and 
a brief review of the literature for Markov chain convergence via renewal theory. 

2 Baclcground 

Consider a data set with n cases and p variables, where x = (xi, ...,Xp) represents the complete 
data and Xj = (xi^j, ...,Xn,i)'^ is the i-th. variable. Let be the vector of observed data indicators 
for variable i, equaling 1 for observed variables and for missing, and let x°^'^ and x™*** denote the 
observed and missing subsets of variable i: 

^obs ^ i^ofes .-^-^^^ _^^^|^ ^rms ^ . • ^ _^^|^ ^ = {r^ : i = 1, ...,p}. 

To facilitate our description of the procedures, we define 

= {xf ^ : i = 1, j - 1, j + 1, ...,p}, x™^ = {xr ■■i = h -J - 1, j + 1, ■:,?}■ 

We use boldface x to denote the entire data set and x to denote individual observations. Therefore, 
Xj denotes the j-th variable of one observation and X-j denotes all the variables except for the j'-th 
one. 
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Throughout, we assume that the missing data process is ignorable. One set of sufficient condi- 
tions for ignorabihty is that the rj process is missing at random and the parameter spaces for 
and X are distinct, with independent prior distributions |1H I22j. 

2.1 Inference using multiple imputations 

Multiple imputation is a convenient tool to handle incomplete data set by means of complete-data 
procedures. The framework consists of producing m copies of the imputed data and applying the 
users' complete data procedures to each of the multiply imputed data sets. Suppose that m copies 
of point estimates and variance estimates are obtained, denoted by (^(*\ C/*-*^), i = l,...,m. The 
next step is to combine them into a single point estimate and a single variance estimate {9m, Tm) 
If the imputed data are drawn from the joint posterior distribution of the missing data under 
a Bayesian model, under appropriate congeniality conditions, 9m is asymptotically equal to the 
posterior mean of 9 and Tm is asymptotically equal to the posterior variance of 9 (\2'2\\1'6\) . The large 
sample theory of Bayesian inference ensures that the posterior mean and variance are asymptotically 
equivalent to the maximum likelihood estimate and its variance based on the observed data alone 
(see [5]). Therefore, the combined estimator from imputed samples is efficient. Imputations can 
also be constructed and used under other inferential frameworks; for example, Robins and Wang 
[T71 129] propose estimates based on estimating equations and derive corresponding combining rules. 
For our purposes here, what is relevant is that the multiple imputations are being used to represent 
uncertainty about the joint distribution of missing values in a multivariate dataset. 

2.2 Bayesian modeling, imputation, and Gibbs sampling 

In Bayesian inference, multiply imputed data sets are treated as samples from the posterior dis- 
tribution of the full (incompletely-observed) data matrix. In the parametric Bayesian approach, 
one specifies a family of distributions /(x|0) and a prior vr(0) and then performs inference using 
i.i.d. samples from the posterior predictive distribution, 

p(x™'^|x°^") = / f{^"'''\ji.°'",9)p{9\x."^')d9, (1) 
Je 

where p(^|x) oc 7r{9)f{x.\9). Direct simulation from ([T]) is generally difficult. One standard solu- 
tion is to draw approximate samples using the Gibbs sampler or some more complicated Markov 
chain Monte Carlo (MCMC) algorithm. In the scenario of missing data, one can use the "data 
augmentation" strategy to iteratively draw 9 given (x"***, x™*^) and x™*^ given (x"^'', 9). Under reg- 
ularity conditions (positive recurrence, irreducibility, and aperiodicity; see [8]), the Markov process 
is ergodic with limiting distribution p(x'"*'', 9\x.°^'^). 

In order to connect these results to the iterative imputation that is the subject of the present ar- 
ticle, we consider a slightly different Gibbs scheme which consists of indefinite iteration of following 
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p steps:, 

Step 1. Draw 6* ~ J9(i9|xf ^ x_i) and xf*'^^ - /(xf ''^^Ixf ^ x_i, i9); 
Step 2. Draw 9 ~ p(0|xf^x_2) and x^*^^ ~ /(x^^^*|xf ^ x_2, 0); 

Step p. Draw ^ - p(^|x;''^ x_p) and x^'*^ - /(x^'^^jx^^*, x_p, 

At each step, the posterior distribution is based on the updated values of the parameters and 
imputed data. It is not hard to verify that the Markov chain evolving according to steps 1 to 
p (under mild regularity conditions) converges to the posterior distribution of the corresponding 
Bayesian model. 

2.3 Iterative imputation and compatibility 

For iterative imputation, we need to specify p conditional models, 

for 9j G @j with prior distributions iTj{Oj) for j = 1, ...,p. Iterative imputation adopts the following 
scheme to construct a Markov chain, 

Step 1. Draw 9i from pi(^i|x°^*, x_i), which is the posterior distribution associated with gi and 
TTi; draw x^**^ from 5fi(X]"***|x°''*, x_i, ^i); 

Step 2. Draw 02 from ^2(^21x2'"', x_2), which is the posterior distribution associated with 52 and 
7r2; draw x^^** from c/2(x^^^^|xf ^ x_2, 6I2); 

Step p. Draw Op from pp(^p|x°''*, x_p), which is the posterior distribution associated with gp and 
TTp; draw x^^*^ from ^?p(x^'^^|x°^^ x_p, ^p). 

Iterative imputation has the practical advantage that, at each step, one only needs to set up a 
sensible regression model of x^- given X-^-. This substantially reduces the modeling task, given that 
there arc usually standard linear or generalized linear models for univariate responses of different 
variable types. In contrast, full Bayesian (or likelihood) modeling requires the more difficTilt task 
of constructing a joint model for x. Whether it is preferable to perform p easy task or one difficult 
task, depends on the problem at hand. All that is needed here is the recognition that, in some 
settings, users prefer the p easy steps of iterative imputation. 

But iterative imputation has conceptual problems. Except in some special cases, there will 
not in general exist a joint distribution of x such that /(xj|x_j,^) = gj(xj\x-j,6j) for each j. In 
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addition, it is unclear whether the Markov process has a probabihty invariant distribution; if there 
is such a distribution, it lacks characterization. 

In this paper, we discuss the properties of the stationary distribution of the iterative imputation 
process by first classifying the set of conditional models as compatible (defined as there existing a 
joint model / which is consistent with all the conditional models) or incompatible. 



We refer to the Markov chain generated by the scheme in Section 2.2 as the Gibbs chain and 
that generated by the scheme in Section [2.3| as the iterative chain. Our central analysis works by 
coupling the two. 

3 Compatible conditional models 
3.1 Model compatibility 

Analysis of iterative imputation is particularly challenging partly because of the large collection 
of possible choices of conditional models. We begin by considering a restricted class, compatible 
conditional models, defined as follows: 

Definition 1 A set of conditional models {gj{xj\x-j,9j) : 6j E Qj,j = is said to be 

compatible if there exists a joint model {f{x\9) : 9 € 0} and a collection of surjective maps, 
{tj : 9 — > 9j : j = 1, ...,p} such that for each j, 9j G &j, and 6 G tj^{9j) = {9 : tj{9) = 9j}, 

9j{xj\x-j,9j) = f{xj\x^j,9). 

Otherwise, {gj : j = l,...,p} is said to be incompatible. 

Though imposing certain restrictions, compatible models do include quite a collection of procedures 
practically in use (e.g. ice in Stata). In what follows, we give a few examples of compatible and 
incompatible conditional models. 

We begin with a simple linear model, which we shall revisit in Section [5] 

Example 1 (bivariate Gaussian) Consider a binary continuous variable {x, y) and conditional 
models 

These two conditional models are compatible if and only if {/3x\y, /3x\y,Tx,Ty) lie on a subspace 
determined from the joint model, 

M-ivff ^O^^]"), whereJ:=( ^^^^ 

with CxjCTy > and p £ [—1,1]. The reparameterization from {fJ-x, PyjO'xjO'y, p) to the parameters 
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of the conditional models is: 

( 2 2\ / o 2\ / P^x X / 2 \ 2 

ti[^ix,a^,fj.y,ay,p) = [ax\y,Px\y,Tx) = [iJ-x , (1 - /> 

\ ay ay 

J- t 2 2 \ t a 2\ ( P'^y P^y t-t 2\ 2\ 
nyP'X^CFxiPyj'^y^P) = \(^y\x,Py\x,Ty) = I % , (1-/3 )ay\. 

^ X ^ X 

The following example is a natural extension. 

Example 2 (continuous data) Consider a set of conditional linear models: for each j, 

Xj\x-j,l3j,a'j ~ N {{l,x-j)l3j,a'j) , 

where Pj is a p x 1 vector, 1 = (1, l)""". Consider the joint model of {xi, ...,Xp) N{fi,T,). 
Then the conditional distribution of each xj given x^j is Gaussian. The maps tj 's can be derived 
by conditional multivariate Caussian calculations. 

Example 3 (continuous and binary data) Let xi be a Bernoulli random variable and X2 be a 
continuous random variable. The conditional models are as follows: 

xi\x2 ~ Bernoulli ^a+j3x2 ) ' ^^ki ~ N{I3q + /3ixi,o-^). 
The above conditional models are compatible with the following joint model: 

xi ~ Bernoulli{p), X2\xi ~ 

If we let 

ti(p,/3o,^i,(j2) = - ^) = 

t2{p,Po,Pi,(r^) = (/3o,/3i), 

the conditional models and this joint model are compatible with each other. Similarly compatible 
models can be defined for other natural exponential families. See [7^ . 

Example 4 (incompatible Gaussian conditionals) There are many incompatible conditional 
models. For instance, 

x|2/~ A^(/3ly + /32y^l), y|x ~ iV(Aix, 1), 
are compatible only if 1^2 = 0. 

3.2 Total variation distance between two transition kernels 

Let {x™*'i(A;) : k e Z+} be the Gibbs chain and {x™*"'2(A:) : k e Z+} be the iterative chain. Both 
chains live on the space of the missing data. We write the completed data as x*(A;) = (x™^'*(/c), x"^**) 
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for the Gibbs chain (i = 1) and the iterative chain (i = 2). The transition kernels are 

Ki{w, dw') = P(x™^'^(A; + 1) G du;'|x™'''(^) = w), foi i = 1, 2. (2) 

where w is a generic notation for the state of the processes. The transition kernels {Ki and K2) 
depend on x°^^. For simplicity, we omit the index of the notation of Ki. Also, we let 

Kf)(z.,^)^P,(x™^'^(fc) e A), 

for x™'*'*(0) ^ V, V being some starting distribution. The probability measure Py also depends on 
x°'"'. Let dxv denote the total variation distance between two measures, that is, for two measures, 
vi and j^2) defined on the same probability space 

dTv{i^i:V2) = sup \vi{A) -U2{A)\. 



We further define 



\v\y = sup / h[x)v{dx) 

\h\<V J 



\h\<V. 

and ||z^||i = ll^^lly for V = 1. Let 1/^°''" be the stationary distribution of K^. We intend to establish 
conditions under which 

, - v-obs 'v'obs . 

drvii^f )^0 

in probability as n — )• 00 and thus the iterative imputation and the joint Bayesian imputation are 
asymptotically the same. 

Our basic strategy for analyzing the compatible conditional models is to first establish that 
the transition kernels Ki and K2 are close to each other in a large region An (depending on the 
observed data x"***), that is, \\Ki{w^-) — K2{'w,-)\\\ — )• as n — cxd for if G A„; and, second, to 
show that the two stationary distributions are close to each other in total variation in that the 
stationary distributions are completely determined by the transition kernels. In this subsection, we 
start with the first step, that is, to show that Ki converges to K2. 

Both the Gibbs chain and the iterative chain evolve by updating each missing variable from 
the corresponding posterior predictive distributions. Upon comparing the difference between the 



two transition kernels associated with the simulation schemes in Sections |2.2| and 2.3, it suffices to 
compare the following posterior predictive distributions (for each j = 1 , . . . , p) , 

/(x™«|xf ,x„,) = y/(xf^|xf ,x_„%(0|xf ,x_,)(i0 (3) 

5,(x™-|xf ,x_,) = |5i(xrixf ,x_„0,)p,(0,|xf ,x„,)d^„ (4) 

where p and pj denote the posterior distributions under / and gj respectively. Due to compatibility. 
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the distributions of the missing data given the parameters are the same for the joint Bayesian model 
and the iterative imputation model: 

if tj{9) = 9j. The only difference lies in their posterior distributions. In fact, the || • ||i distance 
between two posterior predictive distributions is bounded by the distance between the posterior 
distributions of parameters. Therefore, we move to comparing p(0|x^'''', x_j) and pj(^j|x^'"', x_j). 

Parameter augmentation. Upon comparing the posterior distributions of 9 and 9j, the first 
disparity to reconcile is that the dimensions are usually different. Typically 9j is of a lower dimen- 
sion. Consider the linear model in Example [l} The conditional models include three parameters 
(two regression coefficients and variance of the errors), while the joint model has five parameters 
(/Ux, /^y, crx,(Ty, p). This is because the (conditional) regression models are usually conditional on the 
covariates. The joint model not only parameterizes the conditional distributions of Xj given x_j 
but also the marginal distribution of x_j. Therefore, it includes extra parameters, although the 
distributions of the missing data is independent of these parameters. We augment the parameter 
space of the iterative imputation to {9j, 9*) with the corresponding map 9j = t*j{9). The augmented 
parameter {9j,9*) is a non-degenerated reparameterization of 9, that is, Tj(9) = {tj{9),t*{9)) is a 
one-to-one (invertible) map. 

To illustrate this parameter augmentation, we consider the linear model in Example [T] in which 
^ = {fJ'x,o''^, fJ-y^cTy, p), where we use px and cj^ to denote mean and variance of the first variable, 
Py and ay to denote the mean and variance of the second, and p to denote the correlation. The 
reparameterization is, 

92 = t2{px,crl,py,al,p) = {ay\x,Py\x,ry) = {py - —px,—,^^ - P^)'^y), 

^x fx 

9*2 = t*2{px,al,py,al,p) = {px,al). 

The function t2 maps to the regression coefficients and the variance of the residuals; ^2 maps to the 
marginal mean and variance of x. Similarly, we can define the map of ti and t^. 

Impact of the prior distribution. Because we are assuming compatibility, we can drop the 
notation gj for conditional model of the j-th variable. Instead, we unify the notation to that of 
the joint Bayesian model /(xj|x_j, 9). In addition, we abuse the notation and write /(xj|x_j, 9j) = 
f{xj\x^j,9) ioTtj{9) = 9j. For instance, inExample[l| we write /(y|x, a^i^, /3y|x, CTyi^,) = f{y\x,px,Py,crx,cry,p) 
as long as ay\x = /^y - ^/^x, /3y|x = and a'^^^ = (1 - p^)cr^. 

The prior distribution tt on 6* for the joint Bayesian model implies a prior on {9j,9*), denoted 
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by 



For the full Bayesian model, the posterior distribution of 9j is 




■^f,^^,\e,,9*)'K*{e,,e*)de*. 



Because f{'x.j''^\-K^j,6j,9j) = /(x^^^'lx-j, 6*^), the above posterior distribution can be further re- 
duced to 




If we write 




then the posterior distribution of 9j under the joint Bayesian model is 



p(%|x- 



x_,)oc/(xf |x_,,^,)7r,-x.,(%). 



Compared with the posterior distribution of the iterative imputation procedure, which is propor- 
tional to 



the difference lies in the prior distributions, TTj{9j) and vrj^x_j (6*^ ) • 

Controlling the distance between the posterior predictive distributions. We put forward 
tools to control the distance between the two posterior predictive distributions in ([s]) and Q. Let 
X be the generic notation the observed data, and let /x(^) and 5x(^) be two posterior densities of 
9. Let h{x\9) be the density function for future observations given the parameter 9, and let /x(^) 
and g-x.ix) be the posterior predictive distributions: 



The next proposition provides sufficient conditions that ||/x — ^xlli vanishes. 

Proposition 1 Let n he the sample size. Let f:x_{9) and g:x_{9) be two posterior density functions 
that share the same likelihood but have two different prior distributions ttj and TTg. Let 



p,{9j\^f,^^j) oc 5,(xf |x_,-,0,)7r,(0,) = /(xf |x_,-,0,)7r,(%) 




It is straightforward to obtain that 



ll/x -5x||i < ll/x -9x||i- 



(5) 



m 



r{9) 



9^(0) _ L{9) 



f^{9) jL{9)f^{9)d9 
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and n denote sample size. Let dL{6) he the partial derivative with respect to 9 and let ^ he a random 
variable such that 

L{e) = L{iJe)+dL{0-{e-f,e), 

where " •" denotes inner product and fj,g = f 0f:x_{9)d9. If there exists a random variable Z{9) with 
finite variance under fx, such that 

\V^dm ■ iO-l^e)\ < \dL{f,g)\Zi9), (6) 
then there exists a constant k. > such that for n sufficiently large 

||/x-5x||i< Yu • (7) 

We prove this proposition in Appendix \K\ 

Remark 1 We adapt Proposition^to the analysis of the conditional models. Expresion ^ implies 
that the posterior standard deviation of 9 is 0(n~^/^). For most parametric models, ^ is satisfied 
as long as the observed Fisher information is bounded from below by en for some e > 0. In 
particular, we let ^(x) be the complete-data MLE and An = {x : |^(x)| < 7}. Then, ^ is satisfied 
on the set An for any fixed 7. 

Remark 2 In order to verify that dlog L{9) is bounded, one only needs to know ttj and TTg up 
to a normalizing constant. This is because the bound is in terms of dL{9) / L{9). This helps to 
handle the situation when improper priors are used and it is not feasible to obtain a normalized 
prior distribution. In the current context, the prior likelihood ratio is L(9j) = iTj{9j)/Trj^:x__.{9j). 

Remark 3 If L{9) is twice differentiable, the convergence rate in ^ can be improved to 0(n~^/^). 
However, 0(n^^/^) is sufficient for the current analysis. 

3.3 Convergence of the invariant distributions 

With Proposition [T] and Remark [T| we have estabhshed that the transition kernels of the Gibbs 
chain and the iterative chain are close to each other in a large region An. The subsequent analysis 
falls into several steps. First, we slightly modify the processes by conditioning them on the set 
An with stationary distributions 9^°'"' (details provided below). The stationary distributions of 
the conditional processes and the original processes (t'j^"'* and are close in total variation. 

Second, we show (in Lemma [2| that, with a bound on the convergence rate, u^°*"' and v^°'"' are 
close in total variation and so it is with uf-°^' and u^"^" . The bound of convergence rate can be 
established by Proposition [2j 
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To proceed, we consider the chains conditional on the set An where the two transition kernels 
are uniformly close to each other. In particular, for each set B, we let 

~ Ki{w,BnAn) 

That is, we create another two processes, for which we update the missing data conditional on 
X G An- The next lemma shows that we only need to consider the chains conditional on the set 

An- 

Lemma 1 Suppose that both Ki and K2 are positive Harris recurrent. We can choose An as in 
the form of Remark^ and 7 sufficiently large so that 

uf'iAn)^! (9) 

in probability as n —t- 00. Let x™'''*(A;) be the Markov chains following Ki, defined as in Q, with 
invariant distribution i>^°''" . Then, 

drviiyf )^0, (10) 

as n —>• 00 . 

The proof is elementary by the representation of u^"^" through the renewal theory and therefore 
is omitted. Based on the above lemma, we only need to show that dTv{^^°^'' ^^'2°^^) ~^ 0- The 
expression •) — K2{w, •)||i approaches uniformly for w G An- This implies that 

\\Ki{w,-),K2{w,-)\\i^0 

as n — )• 00 uniformly for w G An- With the above convergence, we use the following lemma to 
establish the convergence between P^"'* and v^°'"' . 

Lemma 2 Let x™'^'*(A;) admit data- dependent transition kernels Ki for i = 1,2. We use n to 
denote sample size- Suppose that each Ki admits a data- dependent unique invariant distribution, 
denoted by 9^°''" , and that the following two conditions hold: 

1- The convergence of the two transition kernels 

diAn) = sup \\Ki{w, •) - K2{W, ■)\\v ^ 0, (11) 

w£A„ 

in probability as n — )• 00. The function V is either a geometric drift function for K2 or a 
constant, i-C-, V = 1- 

2- Furthermore, there exists a monotone decreasing sequence — >• (independent of data) and 
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a starting measure v (depending on data) such that 



kf'\u,.)-Dr\.)\\y<r,,^k>Q 



(12) 



as n —>• oo. 



Then, 



in probability as n ^ oo. 



I ,~,x° 



,~.x° 



(13) 



Remark 4 The above lemma holds if V = 1 or V is a drift function. For the analysis of conver- 
gence in total variation, we only need that V = 1. The results when V is a drift function is prepared 
for the analysis of incompatible models. 

The first condition in the above lemma has been obtained by the result of Proposition [T} 



Condition (12) is more difficult to establish. According to the standard results in [18] (see also 



(35) in the appendix), one set of sufficient conditions for (12) is that the chains Ki and K2 admit 
a common small set, C; in addition, each of them admits their own drift functions associated with 
the small set C (c.f. Appendix [C|) . 

Gibbs chains typically admit a small set C and a drift function V, that is, for some positive 
measure /x 



kiiw,A) > qimiA), 
for w £ C, qi £ (0, 1); for some Ai G (0, 1) and for all w ^ C 

XiV{w)> [ V{w')ki{w,dw'). 



(14) 



(15) 

With the existence of C and V a bound of convergence (with starting point w £ C) can be 
established for the Gibbs chain by standard results (see, for instance, [18j), and r^. only depends on 
Ai and qi. Therefore, it is necessary to require that Ai and qi are independent of x°^*. In contrast, 
the small set C and drift function V could be data-dependent. 

Given that Ki and K2 are close in "|| • the set C is also a small set for K2, that is 
K2{w,A) > 52^2(^)5 for some q2 £ (0,1), all w £ C, and all measurable set A. The following 
proposition, whose proof is given in the appendix, establishes the conditions under which V is also 
a drift function for K2. 

Proposition 2 Assume the following conditions hold. 



1. The transition kernel Ki admits a small set C and a drift function V satisfying (15). 
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2. Let Lj(9j) = TTj{9j)/TTj^-x__j{0j) (j = 1, ...,p) be the the ratio of prior distributions for each con- 
ditional model (possibly depending on the data) so that on the set An sup|g_^.|^^ dLj{6j)/Lj{9j) < 
oo. 

3. For each j and I < k < p — j , there exists a Zj{9j) serving as the bound in (|6| for each Lj. 
In addition, Zj satisfies the following moment condition 

El [ Z]^,{9,+i)V\w,+k) \w,]= oin)V\wj), (16) 

where Ei is the expectation associated with the updating distribution of Ki and wj is the state 
of the chain when the j-th variable is just updated. The convergence o{n)/n — )• is uniform 
in Wj G An. 

Then, there exits A2 G (0, 1) such that as n tends to infinity with probability converging to one the 
following inequality holds 

\2V{w)> ! V{w')K2{w,dw'). (17) 



Remark 5 The intuition of the above proposition is as follows. V satisfying inequality (15) is 
a drift function of Ki to C . Since the Ki and K2 are close to each other, we may expect that 
J V{w')Ki{w,dw') ~ j V{w')K2{w,dw'). The above proposition basically states the conditions 
under which this approximation is indeed true and suggests that V be a drift function of K2 if it is 



a drift function of Ki. Condition (16) is imposed for a technical purpose. In particular, we allow 
the expectation of Z'j_^i{9jj^i)V'^{wjj^k) to grow to infinity but at a slower rate than n. Therefore, 
it is a mild condition. 

We now summarize the analysis and the results of the compatible conditional models in the 
following theorem. 

Theorem 1 Suppose that a set of conditional models {gj{xj\x-^j,9j) : 9j £ Qj^j = is 
compatible with a joint model {f{x\9) : 9 £ Q}. The Gibbs chain and the iterative chain then admit 
transition kernels Ki and unique stationary distributions vf"^" . Suppose the following conditions 
are satisfied: 

Al Let An = {'X- : |^(x)| < 7}. One can choose 7 sufficiently large so that 1/^°'"' [An) — ?■ 0, in 
probability as n —)• 00. 

A2 The conditions in Proposition^ hold. 

Then, 



rohs -v-ohs , 



7 / X X 

in probability as n —)■ 00. 
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Remark 6 One sufficient condition for Al is that the stationary distributions of6{'x) under v^"^" 
converge to a value 6^, where 9^ and 6'^ are not necessarily the same. 

Remark 7 In addition to the conditions of Proposition [7J Proposition also requires that one 
constructs a drift function towards a small set for the Gihhs chain. One can usually construct qi 
and Al free of data if the proportion of missing data is bounded from the above by 1 — e. The most 
difficult task usually lies in constructing a drift function. For illustration purpose, we construct a 
drift function (in the supplement material) for the linear example in Section^ 

Proof of Theorem [l| We summarize the analysis of compatible models in this proof. If gj^s 
are compatible with /, then the conditional posterior predictive distributions of the Gibbs chain 
and the iterative chain are given in ([s]) and Q. Thanks to compatibility, the "|| • ||i" distance 
between the posterior predictive distributions are bounded by the distance between the posterior 
distributions of their own parameters as in ([5|. 

On the set A„, the Fisher information of the likelihood has a lower bound of en for some £. 
Then, by Proposition [T] and the second condition in Proposition [2| the distance between the two 
posterior distributions is of order 0(n~^/^) uniformly on set An. Similar convergence result holds 
for the conditional transition kernels, that is, •) — K2{w, •)||i — )• 0. Thus, the first condition 

in Lemma [2] has been satisfied. 

To verify the conditions of Proposition [2| one needs to construct a small set C such that ( 14 ) 



holds for both chains and a drift function V for one of the two chains such that (15) holds. Based 
on the results of Proposition ^ Ki and K2 share a common data-dependent small set C with q^ 
independent of data and a drift function V (possibly with different rate Ai and A2). 

According to the standard bound of Markov chain rate of convergence (for instance, [iHj and 



(35) in the appendix), there exists a common starting value w ^ C and a bound such that the 
bound ( 12 ) in Lemma [2] is satisfied. Thus both conditions in Lemma [2] have been satisfied and 
further 

, , v-obs v-obs , 

in probability as n — >• 00. According to condition Al and Lemma [T| the above convergence implies 
that 

, -v-obs v-obs , 

Thereby, we conclude the analysis. ■ 

3.4 On the necessity of model compatibility 

Theorem [T] shows that for compatible models and under suitable technical conditions, iterative 
imputation is asymptotically equivalent to Bayesian imputation. The following theorem suggests 
that model compatibility is typically necessary for this convergence. 
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Let P-^ denote the probability measure induced by the posterior predictive distribution of the 
joint Bayesian model and P? denote those induced by the iterative imputation's conditional models. 
That is, 

J A 
J A 

Furthermore, denote the stationary distributions of the Gibbs chain and the iterative chain by u^"'" 

, vobs 

and 

Theorem 2 Suppose that for some j £ Z+, sets A and C , and e G (0, 1/2) 

inf PH-^T' e A\ii^YX^') > sup Pf{xf'' G ^|x™^x°''^) + e 

or 

sup Pf (x™^ G A|x!!^f, x°*^) < inf P-^(x™" G Ajx'^f , x"''^) - e 
x^'^ec x^j^ec 

and i/f°'"(x™^ G C) > g G (0, 1). 

Then there exists a set B such that 

Proof. Suppose that 

inf Pf (x™^ G ^|x!"f , x°^^) > sup P-^(x™^ G A|x™f , x°''^) + e, 

T/ie "less than" case is completely analogous. Consider the set B = {x™'^ : x™* G C, x™'^ G A}. 
If 

\vf (x™^ G C) - z^f "'^ (x™^ e C)| < qe/2, (18) 

i/ien, &2/ t/ie fact that 
we obtain 

\^f'^{^^rms ^ ^) _ ^x-^^^x^^s G P)| > (?e/4. 



otherwise, if {18) does not hold, let B = {x™^ : x™* G C}. 



For two models with different likelihood functions, one can construct sets A and C such that the 
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conditions in the above theorem hold. Therefore, if among the predictive distributions of all the 
p conditional models there is one gj that is different from / as stated in Theorem [2| then the 
stationary distribution of the iterative imputation is different from the posterior distribution of the 
Bayesian model in total variation by a fixed amount. For a set of incompatible models and any 
joint model /, there exists at least one j such that the conditional likelihood functions of given 
x_j are different for / and gj. Their predictive distributions have to be different for Xj. Therefore, 
such an iterative imputation using incompatible conditional models typically does not correspond 
to Bayesian imputation under any joint model. 

4 Incompatible conditional models 

In this section, we proceed to the discussion of incompatible conditional models. We first extend 
the concept of model compatibility to semi-compatibility which includes the regression models we 
have generally seen in practical uses of iterative imputation. We then introduce the validity of semi- 
compatible models. Finally, we show that if the conditional models are semi-compatible and valid 
(together with a few mild technical conditions) the combined imputation estimator is consistent. 

4.1 Semi-compatibility and model validity 

As in the previous section, we assume that the invariant distribution exists. For compatible con- 
ditional models, we used the posterior distribution of the corresponding Bayesian model as the 
natural benchmark and show that the two imputation distributions converge to each other. We 
can use this idea for the analysis of incompatible models. In this setting, the first issue is to find 
a natural Bayesian model associated with a set of incompatible conditional models. Naturally, we 
introduce the concept of semi-compatibility. 

Definition 2 A set of conditional models {hj{xj\x-j,9j,(pj) : j = 1, ■■.,p}, each of which is indexed 
by two sets of parameters {9j,ipj), is said to be semi-compatible, if there exists a set of compatible 
conditional models 



for j = 1, ...,p. We call {gj : j = 1, ...,p} a compatible element of {hj : j = 1, ...,p} . 

By definition, every set of compatible conditional models is semi-compatible. A simple and un- 
interesting class of semi-compatible models arises with iterative regression imputation. As typically 
parameterized, these models include complete independence as a special case. A trivial compatible 
element, then, is the one in which xj is independent of x-j under gj for all j. Throughout the discus- 
sion of this section, we use {gj : j = 1, ...,p} to denote the compatible element of {hj : j = 1, ...,p} 
and / to denote the joint model compatible with {gj : j = 1, ...,p}. 




(19) 
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Semi-compatibility is a natural concept connecting a joint probability model to a set of condi- 
tionals. One foundation of almost all statistical theories is that data are generated according to 
some (unknown) probability law. When setting up each conditional model, the imputer chooses a 
rich family that is intended to include distributions that are close to the true conditional distribu- 
tion. For instance, as recommended by |13j . the imputer should try to include as many predictors 
as possible (using regularization as necessary to keep the estimates stable). Sometimes, the degrees 
of flexibility among the conditional models are different. For instance, some includes quadratic 
terms or interactions. This richness usually results in incompatibility. Semi-compatibility includes 
such cases in which the conditional models are rich enough to include the true model but may not 
be always compatible among themselves. To proceed, we introduce the following definition. 

Definition 3 Let {hj : j = 1, ...,p} be semi-compatible, {gj : j = 1, ...,p} be its compatible element, 
and f be the joint model compatible with gj. If the joint model f{x\9) includes the true probability 
distribution, we say {hj : j = 1, ...,p} is a set of valid semi-compatible models. 

In order to obtain good prediction, we need the validity of the semi-compatible models. A 
natural issue is the performance of valid semi-compatible models. Given that we have given up 
compatibility, we should not expect the iterative imputation to be equivalent to any joint Bayesian 
imputation. Nevertheless, under mild conditions, we are able to show the consistency of the com- 
bined imputation estimator. 

4.2 Main theorem of incompatible conditional models 

Now, we list a set of conditions: 

Bl Both the Gibbs and iterative chains admit their unique invariant distributions, 1/^°'"' and 
^2 

B2 The posterior distributions of 6 (based on /) and {6j,ipj) (based on hj) given a complete 
data set x have the representation |^ — 6'| < (,n~^^'^, \{0j — Oj,ipj — ipj)\ < ^jn~^/^, where 9 
is the maximum likelihood estimate of /(x|^), {6j, (pj) is the maximum likelihood estimate of 
/ij(xj|xj, (/7j), and Se'^^l < k, E'e'^' < k for some k > 0. 

B3 All the score functions have finite moment generating functions under /(x™**|x°^*, 0). 

B4 For each variable j, there exists a subset of observations Lj so that for each i G tj Xij is 
missing and Xi-j is fully observed. In addition, the cardinality #(ij) — )• 00 as n — >• 00. 

Remark 8 Conditions B2 and B3 impose moment conditions on the posterior distribution and the 
score functions. They are satisfied by most parametric families. Condition B4 rules out certain 
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boundary cases of missingness patterns and is imposed for technical purposes. The condition it is 
not very restrictive because it only requires that the cardinality of lj tends to infinity, not necessarily 
even of order of 0{n). 

We now express the fifth and final condition which requires the following construction. Assume 
the conditional models are valid and that the data x is generated from f(x.\6^). We use 0^ = tj{9^) 
and 99^ = to denote the true parameters under hj. We define 

e = snpfi^"''\e), 9j = t,{§), (20) 
e 

be the observed-data MLE and 

= argsup / log/(x|0)i.f "(dx™-), (0^,0^) = arg sup / logh,{^j\^.„e„^,)uf'\d^^n 
9 J Sjt'Pj J 

(21) 

where x = (x°*^x™^). 

Consider a Markov chain x*{k) corresponding to one observation — one row of the data matrix — 
living on BP . The chain evolves as follows. Within each iteration, each dimension j is updated 
conditional on the others according to the conditional distribution 

hj {xj I x_ j , Oj , ) , 

where {6j, (pj) = {Oj, 0) +£S,j and is a random vector with finite MGF (independent of everything 
at every step). Alternatively, one may consider {9j, ipj) as a sample from the posterior distribution 
corresponding to the conditional model hj. Thus, x*{k) is the marginal chain of one observation 
in iterative chain. Given that x™''''^(A;) admits a unique invariant distribution, x*{k) admits its 
unique stationary distribution for e sufficiently small. Furthermore, consider that x{k) is a Gibbs 
sampler and it admits stationary distribution f{x\9), that is, each component is updated according 
to the conditional distribution /(xj|x_j, 9) and the parameters of the updating distribution are set 
at the observed data maximum likelihood estimate, 9. The last condition is stated as follows. 

B5 x*{k) and x{k) satisfy conditions in Lemma [2] as e — t- 0, that is, the invariant distributions of 
x*{k) and x{k) converges in || • \\v norm, where y is a drift function for x*{k). There exists 
a constant k such that all the score functions are bounded by 

dlog f{x\9°) < kV{x), dloghj{xj\x_j,tj{9°),ipj = 0) < kV{x). 

Remark 9 By choosing e small, the transition kernels of x*{k) and x{k) converge to each other. 
Condition B5 requires that Lemma applies in this setting, that their invariant distributions are 
close in the sense stated in the Lemma. This condition does not suggest that Lemma applies to 
1/^°^" and which represents the joint distribution of many such x*{k) 's and x{k) 's. 
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We can now state the main theorem in this section. 



Theorem 3 Consider a set of valid semi- compatible models {hj : j = 1, ■■■,p}, and assume condi- 



tions Bl-5 are in force. Then, following the notations in (21), the following limits hold: 



em^e^ ef ^t,{e\ ^fUo, (22) 

in probability as sample size n — )• oo for all j . 

Remark 10 The expression 9^'^^ corresponds to the following estimator. Impute the missing data 
from distribution ly^"*"" m times to obtain m complete datasets. Stack the m datasets to one big 

'(2) "(2) 

dataset. Let 6m be the maximum likelihood estimator based on the big dataset. Then, Om converges 
to 6^"^^ as m — )■ cxD. Furthermore, 6^"^^ is asymptotically equivalent to the combined point estimator of 

^(2\ ^('2') 

6 according to Rubin's combining rule (with infinitely many imputations). Similarly, {9- ,ipj ) is 
asymptotically equivalent to the combined estimator of the conditional model. Therefore, Theorem 
suggests that the combined imputation estimators are consistent under conditions Bl-5. 

5 Linear example 

5.1 A simple set of compatible conditional models 

In this subsection, we study a linear model as an illustration. Consider n i.i.d. bivariate observations 
(x,y) = {{xi,yi) : i = 1, ...,n} and a set of conditional models 

Xi\yi N{f3^\yyi,T^), yi\xi N{l3y\^Xi,Ty). (23) 

To simplify the discussion, we set the intercepts to zero. As discussed previously, the joint compat- 
ible model assumes that (x, y) is a bivariate normal random variable with mean zero, variances cr^ 



and ay, and correlation p. The reparameterization from the joint model to the conditional model 
of y given x is 

I3y\x = —P, = (1 

Figure [l] displays the missingness pattern we are assuming for this simple example, with a denoting 
the set of observations for which both x and y are observed, b denote those with missing y's, 
and c denoting those with missing x's; n^, Ub, and Uc denote their respective sample sizes, and 
n = Ua -\- Uf, -\- Uc. To keep the example simple, we assume that there are no cases for which both 
X and y are missing. 

Positive recurrence and limiting distributions. The Gibbs chain and the iterative chain 
admit a common small set containing the observed-data maximum likelihood estimate. The 
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X Y 



Figure 1: Missingness pattern for our simple example with two variables. Gray and white areas 
indicate observed and missing data, respectively. This example is constructed so that there are no 
cases for which both variables are missing. 

construction of the drift functions is tedious and is not particularly relevant to the current dis- 
cussion, and so we leave their detailed derivations to the supplemental materials available at 
http: / / stat.columbia.edu/'^jcliu/paper/driftsupp.pdfl. We proceed here by assuming that they are 
in force. 

Total variation distance between the kernels. The results for incompatible models apply 
here. Thus, condition Al in Theorem [T] has been satisfied. We now check the boundedness of 
dL{9). The posterior distribution of the full Bayes model is 

p(o-x>Ty>/3j/|x|x,y) oc /(x,y|a^,T^,/3j,|3.)7r*((T^,r^,/3y|a.) 

= /(yk^, x)/(x|(T^)7r*(cr^, r^, 

The posterior distribution of {Ty,liy\x) with a"^ integrated out is 

P(Ty,/3y|x|x,y) oc /(y|ry,/3j^|^,x)7rx(/33/|x,Ty), 

where 

T^^{Py\x,Tl) cc j f{x\al)ir*{al,T^,l3yi^)dal. 

The next task is to show that 7rx(/3j^|x) Ty) is a diffuse prior satisfying the conditions in Proposition 
[ij We impose the following independent prior distributions on cr^, ay, and p: 

The distribution of x does not depend on [ay, p). Therefore, under the posterior distribution given 
x, o"^ and (cr^,p) are independent. Conditional on x, o"^ is inverse-gamma. Now we proceed to 
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develop the conditional/posterior distribution of {Ty,(3y\x) given x. Consider the following change 
of variables 

^2 _ 2 I o2 2 n — R / '^x 

"y ~ ^ Py\x"x^ P - Py\x\ o , ^2 ^2 ' 



Then, 



Together with 



we have 



det 



'y ^ l^y\x- 



7r{ay,p ) oc ay, 



7rx(ry,/3y|^) oc /det 7r(cjy, p)p((T3.|x)da^ 

= j (^xP{o''^\'^)d(j1 = C(x). 



Remark 11 // one chooses T^2{Ty-: Py\x) ^ 1 /o'^ the iterative imputation and (24) for the joint 
Bayesian model, the iterative chain and the Gibbs chain happen to have identical transition kernels 
and, therefore, identical invariant distributions. This is one of the rare occasions that these two 
procedures yield identical imputation distributions. 

If one chooses Jeffreys' prior, vr2(r^, /3^|^) oc r~^, then 

and dL is bounded in a suitably chosen compact set containing the true parameters. Thus, Theorem 
[T] applies. 

Empirical convergence check. To numerically confirm the convergence of the two distributions, 
we generate the following data sets. To simplify analysis, let (xj, yj)'s be bivariate Gaussian random 
vectors with mean zero, variance one, and correlation zero. We set Ua = 200, Uf^ = 80, and Uc = 80. 
For the iterative imputation we use Jeffreys' prior p{Ty, /3y\x) oc and p(t^, Px\y) ^x"^- For the 
full Bayesian model, the prior distribution is chosen as in ( |24[ ). 

We monitor the posterior distributions of the following statistics: 

Fi gures I2I shows the quantile-quantile plots of the distributions of (3x and f3y under and 
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Figure 2: Quantile-quantile plots demonstrating the closeness of the posterior distribution of the 
Bayesian model and the compatible iterative imputation distributions for and f3y with sample 
size Ua = 200. 

based on 1 million MCMC iterations. The differences between these two distributions are tiny. 

5.2 Higher-dimensional linear models 

We next consider a more complicated and realistic situation, in which there are p continuous 
variables, xi, ...,Xp. Each conditional model is linear in the sense that, for each j, 

xj\x-j N{{l,xlj)(3j,a]), 

which is the set of compatible models presented in Example [2j 

In the simulation, we generate 1000 samples of {xi, ...,xi) from a 7-dimensional multivariate 
normal distribution with mean and covariance matrix that equals 1 on the diagonals and 0.4 on 
the off-diagonal elements. We then generate another variable y ~ N(—2 + xi + X2 + x^ + x^ — x^ — 
xq — xy, 1). Hence the dataset contains y, xi,X2, ■ ■ ■ , xj. For each variable, we randomly select 30% 
of the observations and set them to be missing. Thus, the missing pattern of the dataset is missing 
completely at random (MCAR). We impute the missing values in two ways: iterative imputation 
and a multivariate Gaussian joint Bayesian model. After imputation, we use the imputed datasets 
and regress y on all x's to obtain the regression coefficients. The quantile-quantile plots in Figure 
[3] compare the imputation distribution of the least-square estimates of the regression coefficients of 
the iterative imputation procedure and the multivariate Gaussian joint model. 

5.3 Simulation study for incompatible models 

We next consider conditional models that are incompatible and valid. To study the frequency 
properties of the iterative imputation algorithm, we generate 1000 datasets independently each 
with a sample size of 10,000. For each dataset, yi ~ Bernouli(0.45), 1/2 ~ Bernouli(0.65), yi 
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Figure 3: Quantile-quantile plots of the imputation distributions of the regression coefficients {y 
on x's) from the joint Bayesian imputation and the iterative imputation. 

and y2 are independent, and the remaining variables come from this conditional distribution: 
xi, . . . ,X5\yi,y2 ~ N{^iyi + fj,2y2,'^), where //i is a vector of I's and fi2 is a vector of 0.5's and S 
is a 5 X 5 matrix that is 1 on the diagonals and 0.2 on the off-diagonal elements. 

As before, we remove 30% of the data completely at random and then impute the dataset using 
iterative imputation. We impute yi and 7/2 using logistic regressions and xi,...,X5 using linear 
regressions. In particular, yi is conditionally imputed given y2,xi, X2, X3,X4, x^, and the interactions 
xiy2 and 2:2^2; 2/2 is conditionally imputed given yi,xi, X2, x^, x^, X5, and the interactions xiyi and 
X2yi; and each xj, j = 1, . . . ,5, is conditionally imputed given yi, y2, and the other four Xj's. The 
conditional models for the Xj's are simple linear models, whereas the logistic regressions for yi 
also include interactions. As a result, the set of conditional models is no longer compatible but 
is still valid. To check whether or not the incompatible models result in reasonable estimates, we 
impute the missing values using these conditional models. For each dataset, we obtain combined 
estimates of the regression coefficients of xi given the others by averaging the least-square estimates 
over 50 imputed datasets. That is, for each dataset, we have 50 imputations, for each of which 
we obtain the estimated regression coefficients of xi\yi,y2, X2, X3, X4, X5. Next, we average over 50 
sets of coefficients to obtain a single set of coefficients. We repeat the whole procedure on 1000 
datasets to get 1000 sets of estimated coefficients. Figure |4] shows the distribution of the estimated 
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Figure 4: Distributions of coefficients of xi regressing on yi,y2,X2,X3,X4,X5 from 1000 im- 
puted datasets using an iterative imputation routine |27|- The dashed vertical hues rep- 
resents the true value of the regression coefficients of the simulation setting, which are 
0.5, 0.25, 0.125, 0.125, 0.125, 0.125. 



coefficients of xi regressing on yi,y2, X2, xs, X4, x^ based on the 1000 independent datasets. The 
frequentist distributions of the combined estimate are centered around their true values indicated 
by the dashed line. This is consistent with Theorem [Sj 

6 Discussion 

Iterative imputation is appealing in that it promises to solve the difficult task of multivariate 
modeling and imputation using the flexible and simple tools of regression modeling. But two key 
concerns arise: does the algorithm converge to a stationary distribution and, if it converges, how 
to interpret the resulting joint distribution of the imputations, given that in general it will not 
correspond exactly to the fitted conditionals from the regression. In this article, we have taken 
steps in that direction. 

There are several natural directions for future research. From one direction, it should be possible 
to obtain exact results for some particular classes of models such as linear regressions with Gaussian 
errors and Gaussian prior distributions, in which case convergence can be expressed in terms of 
simple matrix operations. In the more general case of arbitrary families of regression models, it 
would be desirable to develop diagnostics for stationarity (along with proofs of the effectiveness of 
such diagnostics, under some conditions) and empirical measures of the magnitude of discrepancies 
between fitted and stationary conditional distributions. 
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Another open problem here is how to consistently estimate the variance of the combined impu- 
tation estimator. Given that the imputation distribution of incompatible models is asymptotically 
different from that of any joint Bayesian imputation, there is no guarantee that Rubin's combined 
variance estimator is asymptotically consistent. We acknowledge that this is a challenging problem. 
Even for joint Bayesian imputation, estimating the variance of the combined estimator is still a 
nontrivial task under specific situations; see, for instance, [91II3]. Therefore, we leave this issue to 
future studies. 

We conclude with some brief notes. 

A special case of the compatible models. In the analysis of the conditional models, suppose 
that the parameter spaces of the conditional distribution and the covariates are separable, that 
is, f{xj,x^j\9j,9*) = f{xj\x^j,9j)f{x^j\6*) and there exists a prior vr for the joint model / such 
that 9j and 0* are a priori independent for all j. The, the boundedness of dlog L{Oj) becomes 
straightforward to obtain. Note that L{6j) = vr(0j)/7rj^x_j (%) and 

vr„x_,(%)=vr*(0,) I f{^^,\,e*)7r*{e*)d9*. 

Thus, L{9j) = TT{0j)/TT*{6j) is independent of the data. 

A example when Theorem [l] does not apply. The expression dlog L{9j) in Proposition [l] is 
not always bounded. For example, suppose that vrj^x_j(%) is an informative prior for which the 
covariates of the regression model x_j provide strong information on the regression coefficients 9j 
according to the joint model /. For instance, in Example [3] on pagejTJ the marginal distribution of 
the covariate X2 provides ^/n amount of information on the coefficients of the logistic regression. 
Under this situation. Proposition [l] still holds, but the right-hand side of Q does not necessarily 
converge to zero. Consequently, Theorem [T] (the main result in this section, presented later) does not 
apply. For the properties of iterative imputation under such situations, we can apply the consistency 
result for incompatible models, as discussed in Section |4j That is, the combined estimator is still 
consistent though it is not equivalent to any Bayesian model. 

A Proofs in Section [3] 

Lemma 3 Let Qq and Qi be probability measures defined on the same a-field J- and such that 
dQi = r^^dQo for a positive r.v. r > 0. Suppose that for some e > 0, E^^ (r'^) = E^'^'r < 1 + e. 
Then, 

sup \E'^^{f{X))-E'^^^{f{X))\<e'/\ 
l/l<i 
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Proof of Lemma [3l 

- E'^^ifiX))] = \eQ^ [(1 - r)f{X)]\ 

< E<^- (|r - 1|) < [E'^-ir - iff'^ = {E'^^r^ - l)'^' < e'/\ 



Proof of Proposition [TJ From Lemma [Sj we need to show that 



j r\e)f^{e)dd < 1 + 



Let HL = EfL{e). 



r{e) 



^/nL{fig) 
L{9) Lipe) + dL{i){e - ^xe) 



Then 



Ef{r'^{e)) 



Line 



^ ^^^\dLU,e)\EZ^idL{f,e)rEZ^ 



With a similar argument, there exists a constant ki such that 

\dL{f,e)\ 



Mi 



Therefore, there exists some k > such that 



< 



V L{fi0)y/n L^{M n J fxl 

\dL{flg)\ K 



< 1 + 



L{fig) ^/n' 
Using Lemma [3j we conclude the proof. ■ 

Proof of Lemma [2| For any e,5 > 0, let fe^ = inf{j : V/c > j, < e}. Then, for any m> 



-^f \\v 



< 



^ m 



k=l 



+ 



k=l 



V 



+ 



^ m ^ m 



k=l 



k=l 



V 
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By the definition of k^, each of the first two terms is bounded by e + k^Cs/m, where 
max{ J V{w)K^''\i', dw) : i = 1, 2, 1 < k < A;^}. For the last term, for each k < m and |/| < V, 



< 
< 



{kP{u, dw)- k^^\v,dw)^ j fiw')k2iw,dw') 

kf\u,.)-kP{v,.) ^+d{An) 



+ 



k^^\u,dw)\\ki{w,-)-K2{w,- 



w 



oil), 



as re — 7- oo. The second inequality in the above display holds, if V = 1; if 1/ is a drift function 
of K2, we replace f hy V and the inequality hold by noticing that J V{w')K2{w, dw') < X2V{w). 
Then, by induction, for all k < m, 



<)(^.,•)-i^^(^,•) ^<o(l) 



Therefore, the last term is 



^ m ^ m 

lVi^f)(.,.)--E^f^( 



^,0 



k=l 



k=l 



oil), 



as n — > cc. Thus, for each e > 0, we first choose k,£ and C^, then choose m large such that 



2Ci;ki;/m < e, lastly choose re large such that k'^\v, ■) — K'^' iv, 

< 5e. 



< e. Therefore, 



-.x° 



,--.x° 



Proof of Proposition [2[ The proof uses a similar idea as that of Proposition [l] ki is equivalent 
to updating the missing values from the posterior predictive distribution of / condition on that 
X G An- Similarly, K2 corresponds to the posterior predictive distributions of gj^s. By Proposition 
[l| for all X G An, \Kiix, B) - K2(x, B)\ = ©(re-^/"^), which implies that 

|4^ = l + 0(re-V4). 

K2iw,An) 

The posterior distribution is a joint distribution of the parameter and the missing values. Therefore, 
6j is part of the vector w. Let 



R 



k2iw, dw' 



Kiiw,Ar, 



kiiw,dw') -y ^^K2iw,A 
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where rj^Oj) is the normahzed prior ratio corresponding to the imputation model of the j-th variable, 
whose definition is given in Proposition [l] 
For the verification of the drift function, 



V{'w')K2{'w, dw') 



R X V{w')ki{w,dw') 



< (1 + 0{n-^/^))P I V{w') n (l + 2 



^^^^^'^^^ ^ ki{w,dw'). (26) 



L{fie,)Vri 



Let Wj be the state of the chain when the j-th variable is just updated (then, w' = Wp). Then, 
according to the condition in (116^ , we have that for each j + k < p 



El 



V{wj+k)(l + 2 



dL{^e,+^)Zj+k 
L(/i, 



'j+k. 



Wi 



El iViw,+k)\wj)+oil)V{wj) 



Since the o(l) is uniform in Wj E An, we can apply induction on the product in (26) by conditioning 
on J^j = a{wi, ...,Wj) sequentially for j = 1, ...,n. Therefore, we have 



V{w')K2{w,dw') = {l + o{l)) j V{w')Ki{w,dw') + o{l)V{w) 
< {Xi + o{l))V{w). 

Then, we can find another A2 G (0,1) such that the above display is bounded by X2V{w). Thus, 
V{w) is also a drift function for K2- ■ 

B Proof of Theorem [3] 

Throughout this proof we use the following notation for asymptotic behavior. We say that < 
g{n) = 0{h{n)) if g{n) < ch{n) for some constant c G (0, oo) and all n > 1. We also write 
g{n) = o{h{n)) as n oo if g{n)/h{n) — t- as n — )■ oo. Finally, we write Xn = Op{g{n)) if 
\Xn/g{n)\ is stochastically dominated by some distribution with finite exponential moment. 

Let x(A;) be the iterative chain starting from its stationary distribution u^"^" . Furthermore, let 
be the distribution of x(/i;) when the j-th variable is just updated. Due to incompatibility, 
z/^°'"''s are not necessarily identical. Thanks to stationarity, v^"^" does not depend on k and 



,x°° 

^2,0 



u^p . Let 



(6lj,(^j) = arg sup / log /ij(xj|x_j, 61^, (/7j>^^_i(dx™''). 

The proof consists of two steps. Step 1, we show that for all j, tpj — )• 0, 9j — 6j — )• 0, as n — )• oo, 
where 6j is the observed-data maximum likelihood estimate based on the joint model / (defined 



as in (20)). That is, each variable is updated approximately from the conditional distribution 
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f{xj\x-j, 9). Step 2, we establish the statement of the theorem. 

Step 1. We prove this step by contradiction. Suppose that there exist Eq and jo such that 
\'f>io\ > ^0 or \9jf^ — ^jol > £0- Let x*(A;) be the Gibbs chain whose stationary distribution {lyf"^") 
is the posterior predictive distribution associated with the joint model / (c.f. Definition [3]) . In 
addition, x* starts from its stationary distribution vf"^" . We now consider the KL divergence 

Let x{k,j) and x.*(k,j) be the state at iteration k + 1 and the j-th variable is just updated. Since 
both chains are stationary, the distributions of x(fc, j) and x*(k,j) are free of k. To simplify 
notation, we let 

u = xf-(0,i-l), ^; = x_,(0,i-l) = x_,(0,i), u; = x--(0,j), 
u* = x*™-(0,j-l), ^;*=xl^.(0,j-l)=xl^.(0,j), t/;*=x*™-(0,j). 

That is, u is the missing value of variable j from the previous step and w is the updated missing 
value of variable j. v stands for the variables that do not change in this update. Let Pj{-) be a 
generic notation for density functions of {u,v,w) and Pj(-) for {u* ,v* ,w*). By the chain rule, we 
have that 

p*{u, V, w) 

log — -p* {u, V, w)dudvdw 

Pj{u,v,w) 

p*{u,v) f p*Aw\u,v) 

log -p*{u,v)dudv+ / log — I -p*{u,v,w)dudvdw 

Pj{u,v) J pj{w\u,v) 

p*{v,w) f p*{u\v,'w) 

log -p*{v,w)dvdw+ / log -p*(u,v,w)dudvdw. (27) 

Pj{v,w) J pj{u\v,w) 

log^ '-p*{u,v)dudv = D{uf 28 

Pj{u,v) 

log ^^4^pK^> w)dvdw = D{vf\\uf-'). (29) 
Pj[V,W) ■' '■' 

Furthermore, p*{w\u, v) is the posterior predictive distribution according to / and pj{w\u, v) is the 

posterior predictive according to hj. Note that / is a sub-family of hj for the prediction of variable 

j. Under p*{u,v,w) that is the posterior distribution of /, we have 

log — — — = Op{l), log — — = Op{l). 

^/(x--|x_„^) ^/(x--|x_„0) ' 



By construction. 



and 
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To understand the above estimate, for the posterior predictive distribution of the Gibbs chain, one 
first draw 6 from the posterior distribution which is ^ + Op{n^^^^) and then draw each Xj from 
f{xj\x-j,6). Thus, we may consider that p*{'w\u,v) w /(x™*|x-_j, 6* + ^/y/n). Together with the 
fact that f?log/(x™**|x_j, 9) = Op{n^^^'^) we obtain the above approximation. The same argument 
apphes to the second estimate for pj{w\u, v) too. With these two estimates, we have 



p*(w\u, v) 
log ^^ =Op(l). 

Pj{W\U, V) 



According to condition B3 that all the score functions has exponential moments, then we have 



/ 



p*Jw\u, v) 

log — I rp*(u,v,w)dudvdw = 0(1). (30) 

Pj[w\u,v) ■' 



We insert (28), (29), and (30) back to (27). For all 1 < j < p, we have that 



^(^r"lk2''ri) = ^(^r"lk2T) + 0(l) + / log^^^^^p*{u,v,w)dudvdw. 



' pj{u\v, w) 



We denote the last piece by 



p*{u\v,w) ^ , , , 

Aj = j log — — :^Pj[u,v,w)dudvdw. 



Pj\ 



U V, w 



Note that i/^"*"" = i^^q'"' = ^^p" ■ Then, we have that 



obs I I -j^ohs ^ / -^ohs I I -j^ofas 

'2,0 

ohs . . vofes 



V 

„ , -yrohs , , v-obs . V — ^ , „ / ^ 

= Di'^i Hp ) + Y,Ap + Oil) 

3=1 

i=i 

Thus, X]j=i Ap = 0{1). Note that each Aj is non-negative. Thus, Aj must be bounded for all j, 
that is 

Aj = 0{l). (31) 

In what follows, we establish contradiction by showing that Aj^ — t- oo if {ifjol + {Oj^ — > Eq. 
Now, we change all the j's to jo, that is, 

u = x™^(0,jo-l), t; = x_j„(0,jo-l) =x_,„(0,jo), t« = x™^(0,jo), 
u* = x;™^(0,jo-l), T;*=x*_^.„(0,jo-l)=x_,„.(0,jo), w;*=x*™-(0,jo). 
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Note that u is the missing values of Xj^ from the previous step and w is the missing value for 
the next step. In addition, the update of x™* does not depend on the previously imputed values. 
Therefore, u and w are independent conditional on v. Thus, Aj^^ is reduced to 



p*{u\v,w) f P*jr,{u\v) 

rp* {u,v,w)dudvdw = / log p* { 



log \ ^ ^° ' z.r"(dx— ). 

We further let i be the set of observations where Xj^^ is missing and observed. Use x™^'* to 

denote the missing Xj^'s of the subset i. Then 

robs 



P* (U\V,W) ^ , , , ^ l^- 



log „ / .1 . „..x P7o ^' w)dudvdw > / log , , -1^1 (dx 



that is the joint K-L divergence is bounded from below by the marginal of K-L divergence on 
the subset i. Note that x*™*(0,jo — 1) is the starting value of x* and was sampled from the 
conditional stationary distribution z^^°''^(x™j*|x_jg). Equivalently, x*™'^(0, jo — 1) is sampled from 
/(xtjulxi^^j,), 6) where 6 = + Op(n~^/^) is a posterior sample and ^i-jo is fully observed (by the 
construction of set l) . 

On the other hand, xj^?^'*(0, jq — 1) follows the stationary distribution of the iterative chain and is 
sampled from the previous step (step k—1) according to the conditional model hjQ{xj^\xi-jf^,6jQ, (pj^) 
where (Ojo^^jo) = i^jo^'-Pjo) + Op{n~^/^) is a draw from the posterior distribution. In addition, by 
assumption, the parameters are different by at least eo, that is, {ffjol + \6jo — %ol > ^o- Thus, 
the conditional distributions hjg{-\xi-j^,9j^,(pjQ) and f{-\xi-jQ,6) are different. For some Aq > 
(depending on eq), the KL divergence between the two updating distributions of Xij^ is bounded 
below by some Aq > 0, that is, for i G i 

^( /(-ki-jo'^) II ^io(-|^i,-jo'^io><^io) ) > -^0- 

This provides a lower bound of the KL divergence of one observation. The posterior predictive 
distributions for the observations in l are conditionally independent given {Oj^,ipj^). Thus, the KL 
divergence of the joint distributions is approximately the sum of the individual KL divergence of 
all the observations in l. Then, we obtain that for some Ai > 

A^^= [\og^^^^Mu,v,w)dudvdw > [ log ^^^["(^^ol^'io) x°^-(^^>n,..) (32) 
'° J PnMv,wf' -J ^dr.2^:i,(x-|x_,J 

> Ai#(0. 



Since the number of observations #(i) — )• oo (condition B5), we reached a contradiction to (31). 
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Thus, \(pj\ + \9j — 6j\ = 0(1) as n — )• 00. Thereby, we conclude Step 1. 

Step 2. We first show the consistency of 6'(^). It is sufficient to show that l^^^^ — ^| — ?> 0. ^^^^ 
solves equation 

j 51og/(x'"",x°''^|^(2))i,x°''-(dx"^i^) = 0. 
By Taylor expansion, the MLE has the representation that 

0^(2) _o = o{n-^) j alog/(x'"*^x°^" I ^)z^^°''(dx™"). 

Thus, it is sufficient to show that 

j aiog/(x--,x°^^|0>r''^(dx--) =Op(n). 

Given that 6 — 6^ = Op{n^^^'^). It is sufficient to show that 

J 91og/(x™^x°^"|e0)zy^°'''(dx™^) = o(n). (33) 

Notice that the observed data MLE 9 satisfies 

J 91og/(x™^x°*^|^)/(^ix'"^"|x°^^0^) =0 

and further 

J alog/(x™^x°^^|0O)/((ix™^lx°^^^) = Op(i). 

Then, we only need to show that 

y"alog/(x"^^^x°^^|0°) z/^°'''(dx™") -/(dx™^|x°*^^) =Op(n). (34) 

Consider a single observation x^^^^k). Without loss of generality, suppose that x'^'^^{k) = (xi(/c), Xj 
and x°^^ = (xj+i, Xp). The result of Step 1 suggests that each coordinate of x™'^ is updated 
from 

hj{xj\x-j,ej +Op{l),ipj = Op(l)). 

Thus, x™'^(/c) follows precisely the transition kernel of x*{k) described in condition B5. Therefore, 
we apply Lemma [2] and have that 



alog/(x™^x°''^|r)^/f (dx™^) 

j alog/(x™^x°^^|^°)/(x™^|x"^^0>x™'^ + 0(1). 



Then, (34) is satisfied immediately by adding up the above integral for all observations. Therefore, 



(33) is satisfied and further ^(2) _ ^ g. The proof for 6j and ipj are completely analogous and 
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therefore is omitted. Thereby, we conclude the proof. 



C Markov chain stability and rates of convergence 

In this section, we discuss the pending topic of the Markov chain's convergence. A bound on the 
convergence rate qk is required for both Lemma [2] and |3j In this section, we review strategies in 
existing literature to check the convergence. We first provide a brief summary of methods to control 
the rate of convergence via renewal theory. 

Markov chain stability by renewal theory. We first list a few conditions (cf. t4j), which we 
will refer to later. 

CI Minorization condition: A homogeneous Markov process W{n) with state space in X and 
transition kernel K{w, dw') = P(W{n + 1) E dw'\W{n) = w) is said to satisfy a minorization 
condition if for a subset C C X, there exists a probability measure u on X, I G Z+, and 
q G (0, 1] such that 

K'-'\w,A) > qi^{A) 
for all w & C and measurable A C X. C is called a small set. 

C2 Strong aperiodicity condition: There exists 5 > such that qv{C) > 6. 

C3 Geometric drift condition: there exists a non-negative and finite drift function, V and scalar 
A G (0, 1) such that for all w^C, 

\V{w)> j V{w')K{w,dw'), 

and for all wGC, f V{w')K{w, dw') < b. 

Chains satisfying Al-3 are ergodic and admit a unique stationary distribution 

1 " 

7r(-) = lim -V 



n 

1=1 



for all w. Moreover, there exists p < 1 depending only (and explicitly) on q, 6, A, and b such that 
whenever p < 7 < 1, there exists M < 00 depending only (and explicitly) on q, 5, A, and b such 
that 



sup 

\9\<V 



J g{w')K^''\w,dw') - J g{w')TTidw')\ < MV{w)-f'', (35) 



for all w and k > 0, where the supremum is taken over all measurable g satisfying g{w) < V{w) 
See [18j and more recently [Ij for a proof via the coupling of two Markov processes. 
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A practical alternative. In practice, one can check for convergence empirically. There are many 
diagnostic tools for the convergence of MCMC; see [7] and the associated discussion. Such empirical 
studies can show stability within the range of observed simulations. This can be important in that 
we would like our imputations to be coherent even if we cannot assure they are correct. In addition, 
most theoretical bounds are conservative in the sense that the chain usually converges much faster 
than what it is implied by the bounds. On the other hand, purely empirically checking supplies 
no theoretical guarantee that the chain converges to any distribution. Therefore, a theoretical 
development of the convergence is recommended when it is feasible given available resources (for 
instance, time constraint). 
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